-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25044][SQL] (take 2) Address translation of LMF closure primitive args to Object in Scala 2.12 #22259
Conversation
@cloud-fan you might be interested in this. This is a simpler version that doesn't use udfInternal. This is just about the nullability issue, really. It leaves the current behavior, that schema inference failure is quietly ignored. |
Test build #95387 has finished for PR 22259 at commit
|
LGTM. |
Test build #4297 has finished for PR 22259 at commit
|
inputsNullCheck | ||
.map(If(_, Literal.create(null, udf.dataType), udf.copy(children = newInputs))) | ||
.getOrElse(udf) | ||
case udf@ScalaUDF(func, _, inputs, _, _, _, _, nullableTypes) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we restore the spaces as in the original?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah missed it. We can clean it up later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing it - what is the space issue? There's an additional level of indent because of the if statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is in udf@ScalaUDF
which should have been udf @ ScalaUDF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right. Yeah I didn't mean to change that. It's minor enough to leave I think. (or else standardize across the code)
thanks, merging to master! |
@@ -47,7 +48,8 @@ case class ScalaUDF( | |||
inputTypes: Seq[DataType] = Nil, | |||
udfName: Option[String] = None, | |||
nullable: Boolean = true, | |||
udfDeterministic: Boolean = true) | |||
udfDeterministic: Boolean = true, | |||
nullableTypes: Seq[Boolean] = Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Nil as the default value is dangerous. We even do not have an assert to ensure it is set. We could easily miss the setting without the right values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We put the assert in the rule HandleNullInputsForUDF
.
can we merge inputTypes
and nullableTypes
here so that we don't need to worry about it any more? cc @srowen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here was again that we wanted to avoid changing the binary signature. I know catalyst is effectively private to Spark, but this wasn't marked specifically private; I wondered if it would actually affect callers? If not we can go back and merge it.
Nil is just an empty list; I don't think it's dangerous and it is used above in inputTypes
. It is not always set, because it's not always possible to infer the schema, let alone nullability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is more about the way we handle nullableTypes
if not specified as in https://github.com/apache/spark/pull/22259/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R2157. The test failure of https://github.com/apache/spark/pull/21851/files#diff-e8dddba2915a147970654aa93bee30a7R344 would have been exposed if the nullableTypes
had been updated in this PR. So I would say logically this parameter is required, but right now it is declared optional. In this case, things went wrong when nullableTypes
was left unspecified, and this could happen not only with tests but in "source" too. I suggest we move this parameter up right after inputTypes
so it can get the attention it needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you are saying that some UDF needed to declare nullable types but didn't? I made the param optional to try to make 'migration' easier and avoid changing the signature much. But, the test you point to, doesn't it pass? are you saying it should not pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I could see an argument that it need not block release. The functionality works as intended, at least.
Would you change it again in 2.4.1? If not then we decide to just keep this behavior. Let's say at least get this in if there is a new RC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I could see an argument that it need not block release. The functionality works as intended, at least.
Would you change it again in 2.4.1? If not then we decide to just keep this behavior. Let's say at least get this in if there is a new RC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's mostly about maintainability. We should definitely update the code as @maryannxue said, so that ScalaUDF
is easier to use and not that error-prone. I feel we don't need to backport it, as it's basically code refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maryannxue Please submit a PR to make the parameter nullableTypes
required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah looks like we should just make these changes after all, and for 2.4, as we need another RC.
## What changes were proposed in this pull request? This is a followup of #22259 . Scala case class has a wide surface: apply, unapply, accessors, copy, etc. In #22259 , we change the type of `UserDefinedFunction.inputTypes` from `Option[Seq[DataType]]` to `Option[Seq[Schema]]`. This breaks backward compatibility. This PR changes the type back, and use a `var` to keep the new nullable info. ## How was this patch tested? N/A Closes #22319 from cloud-fan/revert. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? This is a follow-up PR for #22259. The extra field added in `ScalaUDF` with the original PR was declared optional, but should be indeed required, otherwise callers of `ScalaUDF`'s constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name to `handleNullForInputs`. #22259 breaks the previous behavior for null-handling of primitive-type input parameters. For example, for `val f = udf({(x: Int, y: Any) => x})`, `f(null, "str")` should return `null` but would return `0` after #22259. In this PR, all UDF methods except `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` have been restored with the original behavior. The only exception is documented in the Spark SQL migration guide. In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule `HandleNullInputsForUDF` being applied infinitely. ## How was this patch tested? Added UT in UDFSuite Passed affected existing UTs: AnalysisSuite UDFSuite Closes #22732 from maryannxue/spark-25044-followup. Lead-authored-by: maryannxue <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit e816776) Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? This is a follow-up PR for #22259. The extra field added in `ScalaUDF` with the original PR was declared optional, but should be indeed required, otherwise callers of `ScalaUDF`'s constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name to `handleNullForInputs`. #22259 breaks the previous behavior for null-handling of primitive-type input parameters. For example, for `val f = udf({(x: Int, y: Any) => x})`, `f(null, "str")` should return `null` but would return `0` after #22259. In this PR, all UDF methods except `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` have been restored with the original behavior. The only exception is documented in the Spark SQL migration guide. In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule `HandleNullInputsForUDF` being applied infinitely. ## How was this patch tested? Added UT in UDFSuite Passed affected existing UTs: AnalysisSuite UDFSuite Closes #22732 from maryannxue/spark-25044-followup. Lead-authored-by: maryannxue <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? This is a follow-up PR for apache#22259. The extra field added in `ScalaUDF` with the original PR was declared optional, but should be indeed required, otherwise callers of `ScalaUDF`'s constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name to `handleNullForInputs`. apache#22259 breaks the previous behavior for null-handling of primitive-type input parameters. For example, for `val f = udf({(x: Int, y: Any) => x})`, `f(null, "str")` should return `null` but would return `0` after apache#22259. In this PR, all UDF methods except `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` have been restored with the original behavior. The only exception is documented in the Spark SQL migration guide. In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule `HandleNullInputsForUDF` being applied infinitely. ## How was this patch tested? Added UT in UDFSuite Passed affected existing UTs: AnalysisSuite UDFSuite Closes apache#22732 from maryannxue/spark-25044-followup. Lead-authored-by: maryannxue <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Alternative take on #22063 that does not introduce udfInternal.
Resolve issue with inferring func types in 2.12 by instead using info captured when UDF is registered -- capturing which types are nullable (i.e. not primitive)
How was this patch tested?
Existing tests.